feat(orchestrator): per-env sample strategy + env-mix seam by hallerite · Pull Request #2722 · PrimeIntellect-ai/prime-rl

hallerite · 2026-06-05T14:23:20Z

Stacked on #2721 (feat/per-env-advantage). Base this PR on that branch; review/merge it first.

What

Introduce the per-env sampling seam: each training env owns a SampleStrategy (what example to serve, plus an observe() feedback hook), and env selection is delegated to a swappable EnvMixStrategy. Defaults reproduce today's behavior; this is the foundation for curriculum / replay samplers.

Why

TrainSource previously hard-owned dataset iteration and env selection in one class, with no way to (a) plug a different per-env example-selection policy or (b) feed rollout outcomes back to the sampler. Splitting these into per-env + global strategies — and routing scored groups back via observe() — is what makes curriculum learning and (later) replay expressible without touching the dispatcher/perf path.

Changes

orchestrator/sampling.py (new): SampleStrategy ABC + ShuffledCursorSampler default (per-env: shuffle rows once, walk a reshuffling cursor); EnvMixStrategy ABC + WeightedRoundRobin default (which env next).
TrainEnv now owns its dataset via build_sampler() and holds a .sampler — reachable by both the source (pull) and the sink (observe).
TrainSource shrinks to: build per-env samplers + EnvMixStrategy; next_example picks an env then pulls from that env's sampler. (Folds in the env-mix extraction — the "slice b" seam.)
TrainSink.process_group calls env.sampler.observe(survivors) after advantages are assigned — the feedback wire (no-op for the default sampler).

Behavior

Behavior-equivalent to before: a weighted round-robin over per-env datasets that are each shuffled once and walked with a reshuffling cursor. The default observe is a no-op, so default runs train identically. (RNG is now partitioned per-env + mix rather than one shared generator, so the exact example sequence differs from before — same distribution, arbitrary seed; nothing depends on the old ordering.)

Testing

tests/unit/orchestrator/test_sampling.py (new, 8 tests): cursor cycles-without-repeats-then-reshuffles, determinism per seed, empty-rows guard, observe no-op, weighted-RR distribution + determinism.
ruff check + format --check clean; existing test_advantage.py (17) + test_configs.py (106) still pass.
Validated end-to-end on 2× RTX PRO 6000 (Blackwell): a 3-step multi-env reverse_text RL run with two envs (rt-grpo, rt-lenpen) — both sampled every step through the new EnvMixStrategy + per-env samplers (varying ratios), trained cleanly (Error 0.0%, exit 0), with the observe() wire firing per group.

🤖 Generated with Claude Code

Introduce the per-env sampling seam. Each train env owns a `SampleStrategy` (what example to serve, plus an `observe()` feedback hook); env selection is delegated to a swappable `EnvMixStrategy`. Defaults reproduce today's behavior (weighted round-robin over per-env reshuffling-cursor datasets). - `orchestrator/sampling.py` (new): SampleStrategy + ShuffledCursorSampler; EnvMixStrategy + WeightedRoundRobin. - TrainEnv owns its dataset via `build_sampler()` and holds `.sampler`. - TrainSource slims to env-mix + per-env samplers. - TrainSink.process_group calls `env.sampler.observe(survivors)` after advantages (no-op default) — the feedback wire for curriculum / replay samplers. Behavior-equivalent; RNG partitioned per-env + mix. Stacked on feat/per-env-advantage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(orchestrator): per-env sample strategy + env-mix seam#2722

feat(orchestrator): per-env sample strategy + env-mix seam#2722
hallerite wants to merge 1 commit into
feat/per-env-advantagefrom
feat/per-env-sampler

hallerite commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hallerite commented Jun 5, 2026

What

Why

Changes

Behavior

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant